Fast and accurate interpretation of workload classification model

How can we interpret predictions of a workload classification model? A workload is a sequence of operations executed in DRAM, where each operation contains a command and an address. Classifying a given sequence into a correct workload type is important for verifying the quality of DRAM. Although a previous model achieves a reasonable accuracy on workload classification, it is challenging to interpret the prediction results since it is a black box model. A promising direction is to exploit interpretation models which compute the amount of attribution each feature gives to the prediction. However, none of the existing interpretable models are tailored for workload classification. The main challenges to be addressed are to 1) provide interpretable features for further improving interpretability, 2) measure the similarity of features for constructing the interpretable super features, and 3) provide consistent interpretations over all instances. In this paper, we propose INFO (INterpretable model For wOrkload classification), a model-agnostic interpretable model which analyzes workload classification results. INFO provides interpretable results while producing accurate predictions. We design super features to enhance interpretability by hierarchically clustering original features used for the classifier. To generate the super features, we define and measure the interpretability-friendly similarity, a variant of Jaccard similarity between original features. Then, INFO globally explains the workload classification model by generalizing super features over all instances. Experiments show that INFO provides intuitive interpretations which are faithful to the original non-interpretable model. INFO also shows up to 2.0× faster running time than the competitor while having comparable accuracies for real-world workload datasets.


Introduction
How can we provide accurate and fast interpretations for a workload classification model? A workload classification task is to classify a given subsequence, which contains a series of tuples of commands and locations for memory accesses, into a workload type that generates the sequence. As new AI-based applications including self-driving cars and mobile applications emerge, the demands for memory devices like DRAM (Dynamic Random Access Memory) rapidly grow. Since the manufacturers require high-quality memory devices, verifying and improving the quality of DRAM is a crucial task in the real world. An accurate workload classification model helps improve the quality of DRAM. In utilizing a workload classifier, it greatly helps to know why a workload subsequence is classified to a specific class, or which part leads to the misclassification. There are 5 heterogeneous fields within an operation: command, rank, bank group, bank, and address. The previous state-of-the-art model Acorn [1] generates 3 types of features using the 5 fields: CMD, bank-level, and cell-level features. CMD features capture sequential information of the command field. Bank-level and cell-level features are related to address-related fields, i.e., rank, bank group, bank, and address fields. Bank-level feature vectors describe the number of accesses for each bank, where the address field refers to a row or column address of a specific bank. Cell-level features map an address into a memory region inside banks, to capture spatial information within banks.
An accurate, but black-box workload classification model fails to provide the interpretation for the prediction of a given subsequence. For example, it does not explain what features mainly affect the classification result, and why the subsequence is classified into a class. Therefore, we need to exploit an interpretable model to understand the results of subsequence classification. Ribeiro et al. [2] propose LIME which provides interpretation for an instance. Since features used for a neural network cannot be understandable to humans, LIME interprets the prediction for an image using super-pixels obtained by image segmentation. However, LIME cannot give explanations over a workload classifier because super-pixels are limited only to image interpretation. Thus, we need to design appropriate super features for workload classification. Another limitation of LIME is its local interpretability which needs to learn a new model to explain a new instance. Such a locally interpretable model entails a long running time which limits its use for workload classification.
In this paper, we propose INFO (INterpretable model For wOrkload classification), a fast and accurate model for workload classification. We carefully design super features for workload classification. For effective clustering, we propose an interpretability-friendly similarity measure between the original features based on a variant of Jaccard similarity. We also exploit the locality of the bank-level and cell-level features. INFO provides global explanations using super features commonly defined for every instance. Since we use the same super features over all instances, we provide consistent interpretations such as analyzing what super features mainly affect the predictions for subsequences in the same class. To the best of our knowledge, the proposed method is the first work that explains the classification results of workloads. Fig 1 shows  We summarize our main contributions as follows: • Method. We propose INFO, a globally explainable model which gives fast and accurate interpretations for workload classification.
• Interpretation. We enhance the interpretability of the proposed method by clustering the original features based on the hierarchical characteristics of workload data.
• Experiments. We show the interpretation results and prove the effectiveness of the proposed model. INFO achieves up to 2.0× faster running time than the competitor while having a comparable accuracy on real-world workload data.
The rest of the paper is organized as follows. We provide preliminaries, our proposed method, and experimental results in order. Then, we present related works and the conclusion. The codes and data used for INFO are available at https://github.com/snudatalab/INFO.

Preliminaries
In this section, we describe the preliminaries and our problem definition. Table 1 describes the symbols used in this paper.

Workload sequence
We define a workload sequence produced by a DRAM controller unit during a program execution referring to [1].
Definition 1 (Workload Sequence). A workload sequence W 2 R l�5 is a matrix where l is the length of a workload and 5 is the number of fields.
• Command field contains a series of commands where each command is mapped to a number, e.g., 1 refers to ACT command.
• Rank field is an index of a rank, the highest level of DRAM components.
• Bank group field refers to the bank group number within a rank.  • Bank field contains an index of a bank in a bank group.
• Address field represents a row or column address within a bank.
We define the concept of workload subsequence referring to [1] and use it as an instance of the proposed method.
Definition 2 (Workload Subsequence). A workload subsequence w 2 R 100;000�5 is a sub matrix of W where its row length is fixed as 100, 000.

Workload classification
The task of workload classification aims to classify a given subsequence into a workload type that generates it. Each workload contains a sequence of operations where each operation contains 5 heterogeneous fields. The previous state-of-the-art method Acorn [1] performs a workload classification task with carefully designed features including CMD feature, bank-level feature, and cell-level feature. The final feature vector for classification is a concatenation of CMD, bank-level, and cell-level feature vectors. Acorn predicts the label of a subsequence by feeding its feature vector into a neural network. Although Acorn achieves high accuracy for the workload classification, it does not give an easy interpretation of its result because it is a complicated deep learning model. In this work, we additionally provide interpretability for the workload classification model. LIME [2] is a model-agnostic interpretable model which is locally faithful to the pre-trained model. Given an instance and a trained model, LIME aims to interpret the prediction of a given instance. For an image classification task, LIME exploits super-pixels which are obtained through image segmentation where each segment corresponds to a super-pixel. LIME trains a linear model by sampling instances in the vicinity of a given instance. Note that features used for a linear model are super-pixels. The trained weights of a linear model correspond to the contributions super-pixels give to the prediction. LIME is not ready for workload classification for the following two reasons. First, it is not clear how to construct the interpretable features for workload classification. Second, LIME provides local interpretation which gives inconsistent interpretations when we compare the results of many instances. We address the two challenges with the proposed INFO.

Problem definition
We introduce the formal problem definition as follows: Problem 1 (Interpretation for Workload Classification). Given a subsequence and a trained model that classifies the subsequence, find an interpretable model over the subsequence, which is faithful to the trained model.

Proposed method
In this section, we propose INFO for interpreting predictions of a workload classification model. The challenges to be addressed are to 1) provide intuitive and interpretable features, 2) measure the similarity of raw features for constructing the interpretable features, and 3) provide consistent interpretations over all instances. To tackle the above challenges, our ideas are to 1) carefully design super features for workloads, 2) compute the interpretability-friendly similarity, and 3) provide consistent interpretations with global super features over all instances.
Algorithm 1 shows the overall process of INFO. Before interpreting a given test instance, we find super features for three types of feature vectors: CMD, bank-level, and cell-level features. Given a test instance, a set of super features, and a trained classification model f, we first construct a binary vector of super features (line 1 in Algorithm 1). Then, we construct a dataset based on the binary vector and the test instance (line 2 in Algorithm 1), and learn an interpretable model that approximates the given classification model f (line 3 in Algorithm 1). Finally, we interpret the classification results using the weights of the interpretable model, which represent the amount of attribution of super features. 2: Data Construction. Randomly sample z 0 2 R d 0 from x 0 . Construct a dataset Z for the model g as pairs of (z 0 , f(z)) where z 2 R d is a vector that recovers z 0 to the original feature space. If i-th element of z 0 is 1, z recovers the values of features within i-th super feature. Otherwise, the values remain as 0.
3: Training the model. Train the model g using the training set Z by minimizing L(f, g, π x ) in Proposed Method section that makes g approximate the model f. Return the weights of the model g which are the attribution values each super feature gives to the classification result of the given instance.
In the following, we first describe how to design super features by clustering raw features used in a workload classification model. We then propose INFO which gives global interpretations using the same super features over all instances.

Construction of super features
Our objective is to find interpretable super features which reflect original features used in a classification model. Directly using the original features to a local interpretable model degrades the interpretability due to their large number and insufficient information. For example, in an image classification task, interpretation with pixels corresponding to features is more difficult than that with super-pixels which are segmentation results of the image since the super-pixels are more intuitive than the pixels. To improve the interpretability in the workload classification task, we exploit super features corresponding to the super-pixels in images. Our approach is to cluster original features by exploiting their characteristics, and use the clusters as super features.
To obtain super features, we independently perform clustering for the three types of feature vectors in workload classification: CMD, bank-level, and cell-level features. CMD features contain sequential information within a command field. Bank-level and cell-level features are extracted from address-related fields, i.e., rank, bank group, bank, and address fields, capturing spatial information within them. Fig 2 summarizes the process of generating super features for workload classification.
CMD super features. Our goal is to find super features from n-gram CMD features. Although the most frequent n-gram sequences are selected as features for workload classification, they still have redundant sequential patterns. For example, n-gram sequences 33113 and 13133 include common sequences 33 and 13. Therefore, we need to construct super features of CMD features by clustering similar n-gram sequences, and use them for interpreting a prediction result of workload classification. To achieve it, we propose an interpretabilityfriendly similarity to measure the similarity between n-gram sequences.
Before clustering n-gram sequences, we define an interpretability-friendly similarity to group the CMD features. We propose a variant of Jaccard similarity to measure the similarity between two n-gram sequences. Given two n-gram sequences s 1 and s 2 , we define their similarity J(s 1 , s 2 ) as follows: where l s 1 is the length of a sequence s 1 and l s 1 ;s 2 is the length of the common subsequence between the front of s 1 and the end of s 2 . Fig 3 gives an example of generating CMD super features by calculating the similarity. After measuring the similarity for all n-gram sequences, the next step is to perform hierarchical clustering based on the similarity. Treating each n-gram sequence as its cluster, we  identify the most similar pair of clusters and fuse these two clusters. The similarity between two clusters is measured using the average similarity between the elements in the two clusters. Then, we repeat the previous process for the remaining clusters until the desired number of clusters remains. Finally, we use the clustering results as CMD super features.
Bank-level super features. Our goal is to obtain super features from bank-level features which include rank, bank group, and bank fields. As discussed in the Preliminaries section, the rank field is at the highest level among the bank-level features. Therefore, we use the rank as a bank-level super feature based on the hierarchical structure. Fig 4(a) shows an example of DRAM with one rank, one bank group within a rank, and four banks within a bank group. In this example, the four banks correspond to one super feature, Rank 0.
Cell-level super features. Our goal is to find super features for cell-level features. As for cell-level super features, we group adjacent memory blocks by considering the locality within a bank. We define each group of memory blocks as a cell-level super feature. Fig 4(b) illustrates an example of cell-level super features in Bank 1 of size 4 × 4. There are four memory blocks of size 2 × 2 colored blue. Then, we cluster two adjacent memory blocks and define a cell-level super feature of size 2 × 4 colored red.

Locally interpretable model for workload classification
We aim to propose an interpretable model for workload classification utilizing super features.
As a naive approach, we can adopt a previous explanation model called LIME [2] directly to the workload classification model which provides interpretation over an instance. When applied to image classification task, LIME generates super-pixels for each image through image segmentation and provides interpretation over an image. LIME can be used to explain the workload classification result of an instance using super features.
Given a feature vector of a subsequence, we need to convert it into a vector with super features for interpretation. We cluster n-gram sequences appearing in a given instance to generate CMD super features. Bank-level and cell-level super features are ranks and collections of memory blocks, respectively. We define a feature vector x 0 2 R d 0 for an interpretable model where d 0 is the total number of super features. 1 and 0 in x 0 indicate the presence and absence of the corresponding super feature in the given subsequence, respectively. For example, if a subsequence has accessed the first rank, the corresponding element of x 0 is 1. We then train a linear model g which has interpretability in itself. We generate a training dataset Z for a linear model by randomly sampling the binary vectors z 0 2 R d 0 from x 0 . The label of z 0 is the prediction of the pre-trained model f denoted as f(z); z 2 R d is a recovered vector of z 0 where d equals the total number of features for workload classification. Fig 5 illustrates the process of constructing training data for an interpretable model g, using n-gram features. x is a feature vector for a workload classification model f, where each element of x denotes the number of occurrences of the corresponding n-gram sequence in a given instance. To be specific, 11, 0, 30, 2, 0, and 10 are the numbers of occurrences of sequences 113, 131, 111, 555, 535, and 553 in a given instance, respectively. We define a feature vector x 0 for an interpretable model g using CMD super features. Since all super features have appeared in a given subsequence, all elements of x 0 are 1. Two binary vectors z 0 1 and z 0 2 are randomly sampled from x 0 and used to train the model g. In this example, two pairs of (z 0 , f(z)) are set as Z. The model g is trained to approximate the pre-trained model f by minimizing Eq (2).
Lðf ; g; p x Þ ¼ X z;z 0 2Z p x ðzÞðf ðzÞ À gðz 0 ÞÞ 2 ð2Þ π x = exp(−D(x, z) 2 /σ 2 ) measures the similarity between x and z where D(�) denotes a distance function and σ is a kernel width. We provide explanations using weights of a linear model which implies the amount of attribution each super feature gives to the prediction.

Globally interpretable model for workload classification
We propose INFO, a model-agnostic interpretable model for workload classification. INFO provides global super features for workloads and provides explanations over all instances by utilizing them. Unlike the locally interpretable model (e.g., LIME) which gives inconsistent interpretation over different instances, INFO gives consistent interpretation over all instances by generating global super features that are applied to the entire subsequence.
For CMD super features, we cluster all n-gram sequences used as CMD features, and define each group as a super feature. The intuition is that clustering the whole CMD features represents the workloads better than clustering a few n-gram sequences. Generating super features from all CMD features increases computational efficiency because the similarity between features is calculated once. In contrast, the naive approach, which is locally interpretable, should calculate the similarity of features for each test instance and thus requires a long running time. We utilize the rank field as a bank-level super feature based on the structure of DRAM. We compose cell-level super features by clustering the adjacent memory blocks. The process of generating vectors for super features and training an interpretable model g is the same as in the naive approach. Unlike the naive method, INFO explains which super feature contains the most representative patterns over different instances. Global super features applied to all subsequences enable the proposed method to give accurate and fast interpretations that are consistent over the classes.

Experiments
In this section, we experimentally evaluate the performance of INFO. We answer the following questions:

Experimental setting
We construct all models using the Pytorch framework. All the models are trained and tested on a machine with GeForce GTX 1080 Ti GPU. Dataset. We use two datasets which contain real-world workload sequences summarized in Table 2. There are 40 and 31 workloads in SEC-seq and Memtest86-seq datasets, respectively. The lengths of workload sequences are different in both datasets. Each workload corresponds to a class and workload subsequences of length 100, 000 are used as instances where a label is given as a workload that the subsequence belongs to.
Hyperparameter settings. We use three different lengths of n-grams: n = 7, 11, and 15. We set features for the CMD field as the collection of Top-25 n-grams selected from the entire workload. The size of the CMD feature vector corresponds to the total number of selected ngrams. The numbers of 7-gram, 11-gram, and 15-gram sequences used as CMD features in SEC-seq dataset are 154, 236, and 289, respectively. For Memtest86-seq dataset, the numbers of 7-gram, 11-gram, and 15-gram sequences chosen as CMD features are 132, 196, and 215, respectively. We cluster n-gram sequences into 6 groups and use each group as a CMD super feature. We use DRAM with 2 ranks, 4 bank groups within each rank, and 4 banks within each

Performance
We evaluate how faithful and fast INFO is, compared to LIME which is a locally interpretable model that generates super features individually for each test instance. We adopt LIME to our setting by generating super features for each subsequence as described in the Proposed Method section. In contrast, the proposed INFO generates a unified set of super features which further improves the interpretability. As mentioned in the Experimental setting section, we use Top-1 and Top-3 accuracies as evaluation metrics. Both metrics compare the classification result of a black box model and a linear model. Top-1 accuracy computes the ratio of test instances that the predicted classes of two models are the same. Top-3 accuracy computes the ratio of test instances that the class predicted from a black box model is within the highest 3 predicted classes from a linear model. In addition, we compare the running time of INFO and LIME.
We measure the fidelity using 1, 000 test instances. To generate interpretable linear models for INFO and LIME, we create 10, 000 random samples from each instance and use them for training. We report the running time and the fidelity results of two datasets in Fig 6. INFO is up to 2.0× faster than LIME using the SEC-seq dataset. The performance gap becomes larger as the number of instances increases. For the Memtest86-seq dataset, INFO is also up to 1.5× faster and has higher Top-1 and Top-3 accuracies than LIME.

Interpretation
We explain the prediction results of test instances using INFO. Fig 1 explains why two instances of the SEC-seq dataset are correctly classified as class 27. The workload classification model predicts labels of two instances as 27 with a probability 0.99. We show Top-5 super features based on the weights which refer to the amount of attribution each feature gives to the prediction. Super feature 13 gets the highest weight from both instances indicating that access patterns of super feature 13 affect the prediction the most. Super features 2 and 11 commonly have high ranks in both instances while super feature 0 gives a negative impact on the predictions of two instances. From the results, we find that cell-level super features are important for the workload 27. Fig 7 shows an example of interpretations using two instances from class 9 in the SEC-seq dataset. We report the attribution values of classifying the instances to class 9. Super feature 2 gets the highest weight from both instances which means the corresponding ngram sequences have important patterns for the workload 9. We find several negative weights from the interpretation of instance 4 which is misclassified as class 8. Note that super features 1, 11, and 3 restrain the instance 4 from being classified into class 9.  results of two instances from class 10 in the Memtest86-seq dataset. Instance 8 is misclassified as class 9; thus the interpretation of the instance 8 on class 10 has more negative weights than that in the instance 7. CMD super features 1, 0, and 3 have the highest weights in both instances while super feature 5 restrains two instances from being classified into class 10. Note that super feature 4 gets negative weight on the instance 8 which leads to misclassification. Overall, the results of interpretations tell us that bank-level super features do not affect the predictions as much as the others and certain types of super features influence each workload. Fig 10 compares the interpreted results of two instances using LIME and INFO. For interpretations of LIME, super features 13 and 11 are commonly ranked as Top-5 super features from both instances but it does not mean that the features within super features 13 and 11 are crucial for class 27. Since LIME generates super features per instance, super features used to explain the predictions of two instances contain different features resulting in inconsistent explanations. On the other hand, INFO utilizes global super features commonly defined for the entire instances. Thus, we know that super feature 13 contains the most important patterns for class 27 while super feature 0 gives a negative influence on classifying instances to class 27. Table 3 shows an example of CMD super features using LIME and INFO. In LIME, we use two instances from class 27 and 9 in the SEC-seq dataset to show the clustering results

Qualitative analysis of clustering
We provide qualitative analysis for the clustering of n-gram features. We construct the similarity matrix for hierarchical clustering using the interpretability-friendly similarity we proposed. To analyze the clustering result, we visualize clusters using t-SNE, a dimensionality reduction strategy based on the similarity matrix. Fig 11 shows the visualization result where each color  Table 3. Example of CMD super features generated by LIME and INFO. LIME (27) and LIME (9) show the clustering results of two instances from class 27 and class 9 in the SEC-seq dataset, respectively. INFO makes the same super features over all instances while LIME gives inconsistent super features.

Method
Super feature ID Example of n-gram sequences represents a super feature. Note that the super features are well separated. We also report an example of CMD super features in Table 4. All the super features contain recurring subsequences. The results show that super features are well-captured based on the similarity measure.

Related works
We describe related works on interpretable models.

Local explanation model
Locally interpretable models take only one instance and supply the best answer fitted to a given instance. CAM [3] provides visualized explanations by modifying the pooling layer of convolutional neural networks in the image domain. Selvaraju et al. [4] introduce Grad-CAM, a generalized version of CAM. LIME [2] proposes super-pixels to interpret an image classification model. SHAP [5] proposes a generalized framework of interpretable models based on the concept of game theory. MAPLE [6] gives local explanations by fitting the model directly to a given instance and estimates the prediction of a black box model. Adebayo et al. [7] give local explanations by re-initializing the weights of deep neural networks and introduce a similarity  measure between the explanations based on the weights. Ghalebikesabi et al. [8] propose Neighbourhood Shapley values which improve the local interpretability of Shapley values by weighting features based on a distance metric. Graziani et al. [9] apply LIME to clinical decision-making. DLIME [10] changes random-based perturbation of LIME to a deterministic version using agglomerative hierarchical clustering and k-nearest neighbor. LUC-Locator [11] finds optimal LUCs which are sets of input words sufficient to justify the prediction. SIDU [12] shows visual explanations of a given image addressing a salient region localization issue. ExSum [13] proposes a mathematical method to improve the quality of local explanations. LIMREF [14] provides rule-based local explanations for a particular forecast given by the global forecasting model. However, none of the previous local explanation methods are developed for workload classification. Furthermore, they are limited to local explanations, without giving global explanations. We propose INFO which provides global explanations for workload classification results by exploiting global super features.

Global explanation model
If an interpretable model explains the predictions globally, it makes a consistent response to the entire test set. Since intrinsic models (e.g., decision tree) have interpretability itself, their explanations are viewed as global. Sushil et al. [15] formalize the explanations to generate them globally. GALE [16] proposes an aggregation strategy to provide global explanations. MUSE [17] generates global explanations by defining subspaces of features specified by user interest and making decisions through a black box model. GLocalX [18] attaches an interpretable layer to a black box model and iteratively gathers local explanations to generate global explanations. Jacovi et al. [19] propose contrastive explanations which capture the difference between two representation vectors for text data. CIE [20] generates confident itemsets corresponding to a specific class and provides both instance-wise and class-wise explanations. Asano et al. [21] define two hypersphere sets for a global surrogate model to show high recall and high precision, respectively. Previous approaches for global explanations are proposed only in the text domain. In this work, we propose an interpretability-friendly similarity to generate global super features for workload subsequences.

Conclusion
In this paper, we propose INFO, a globally explainable model which provides fast and accurate interpretations for workload classification. INFO provides consistent interpretations for all workload subsequences through global super features. To generate super features, we propose an interpretability-friendly similarity measure between the raw features for workload classification based on a variant of Jaccard similarity. We generate 3 different types of super features for interpretation: CMD, bank-level, and cell-level super features. For CMD super features, we compute the similarity between n-gram sequences and hierarchically cluster the sequences based on the similarity. We also follow the inherent hierarchy of bank-level fields, i.e., rank, bank group, and bank fields, setting each rank as a bank-level super feature. Since the address field indicates an exact location of command operation in a particular bank, we divide each bank into memory blocks and group them considering the locality. Each group is used as a cell-level super feature. INFO trains a linear model to approximate a workload classifier and globally explains the classifier utilizing the super features. Experiments show that the INFO improves computational efficiency while giving accurate and consistent interpretations which are faithful to the workload classification model.